Project Overview

Current project tree

.
├── LICENSE
├── README.md
├── Rplots.pdf
├── cicd.png
├── config
│   ├── config.yaml
│   ├── samples.tsv
│   └── units.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   └── metadata
├── images
│   ├── PRJNA477349_gps.html
│   ├── PRJNA477349_gps.png
│   ├── PRJNA477349_gps_files
│   ├── PRJNA477349_variable_freq.png
│   ├── PRJNA477349_variable_freq.svg
│   ├── PRJNA685168_variable_freq.png
│   ├── PRJNA685168_variable_freq.svg
│   ├── PRJNA802976_gps.html
│   ├── PRJNA802976_gps.png
│   ├── PRJNA802976_gps_files
│   ├── bkgd.png
│   ├── metadata.png
│   ├── project_tree.txt
│   ├── sample_gps.html
│   ├── sample_gps.png
│   ├── sample_gps_files
│   ├── smkreport
│   └── sra_run_selector.png
├── imap-sample-metadata.Rproj
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── imap.bib
│   └── references.bib
├── report.html
├── resources
├── results
│   ├── PRJNA477349_read_size_asce.csv
│   ├── PRJNA477349_read_size_desc.csv
│   ├── PRJNA685168_read_size_asce.csv
│   └── PRJNA685168_read_size_desc.csv
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── reports
    ├── rules
    ├── schemas
    └── scripts

18 directories, 35 files



Current snakemake workflow




General overview

What is metadata?

  • Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
  • Sample metadata described in this book refers to the description and context of the individual sample collected for a specific microbiome study.


Metadata structure

  • Metadata collected at different stages are typically organized in an Excel or Google spreadsheet where:
    • The metadata table columns represent the properties of the samples.
    • The table rows contain information associated with the samples.
    • Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
    • Sampl ID must be unique.


Embedded metadata

  • In most cases, you will find the metadata detached from the experimental data.
  • Embedded metadata integrates the experimental data especially for graphics.
  • Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.


Explore SRA metadata

Brief overview

Typically, after sequencing the microbiome DNA, the investigators are encouraged to deposit the sequence reads in a public repository. The Sequence Read Archive (SRA) is currently the best bioinformatics database for read information. The good thing about SRA is that it integrates data from the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

Metadata via SRA Run Selector

Metadata associated with a specific project can be retrieved manually via the SRA Run Selector or using the Entrez Direct (edirect) scipts.

  • Note that the SRA filename for metadata is automatically named SraRunTable.txt, but for clarity we will provide a filename corresponding to the NCBI-BioProject ID with .CSV extension.
  • We will save the metadata file in data/metdata/ folder.

Let’s create the folder (if it doesn’t exist!).

We will explore more on sample metadata retrieved from four randomly selected microbiome BioProjects, including:

  1. PRJNA477349: 16S: rRNA from bushmeat samples collected from Tanzania Metagenome
  2. PRJNA802976: 16S: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants
  3. PRJNA685168: WGS: Multi-omics suggest diverse mechanisms for response to biologic therapies in IBD
  4. PRJEB21612: WGS: Alterations of the gut microbiome in hypertension


Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349



How many rows and columns

Getting a clear knowledge about the variables associated with a sample metadata can help in filtering the most important features for downstream analysis.

PRJNA477349

[1] "There are 133 rows and 36 columns in PRJNA477349 metadata"
 [1] "Run"                            "Assay Type"                    
 [3] "AvgSpotLen"                     "Bases"                         
 [5] "BioProject"                     "BioSample"                     
 [7] "BioSampleModel"                 "Bytes"                         
 [9] "Center Name"                    "Collection_Date"               
[11] "Consent"                        "DATASTORE filetype"            
[13] "DATASTORE provider"             "DATASTORE region"              
[15] "Experiment"                     "geo_loc_name_country"          
[17] "geo_loc_name_country_continent" "geo_loc_name"                  
[19] "Host"                           "Instrument"                    
[21] "Isolate"                        "Lat_Lon"                       
[23] "Library Name"                   "LibraryLayout"                 
[25] "LibrarySelection"               "LibrarySource"                 
[27] "Organism"                       "Platform"                      
[29] "ReleaseDate"                    "create_date"                   
[31] "version"                        "Sample Name"                   
[33] "SRA Study"                      "collection_season"             
[35] "sample_condition"               "SampleCode"                    


PRJNA802976

[1] "There are 54 rows and 35 columns in PRJNA802976 metadata"
 [1] "Run"                            "Assay Type"                    
 [3] "AvgSpotLen"                     "Bases"                         
 [5] "BioProject"                     "BioSample"                     
 [7] "BioSampleModel"                 "Bytes"                         
 [9] "Center Name"                    "Collection_Date"               
[11] "Consent"                        "DATASTORE filetype"            
[13] "DATASTORE provider"             "DATASTORE region"              
[15] "Experiment"                     "geo_loc_name_country"          
[17] "geo_loc_name_country_continent" "geo_loc_name"                  
[19] "Host"                           "Instrument"                    
[21] "Isolation_Source"               "Lat_Lon"                       
[23] "Library Name"                   "LibraryLayout"                 
[25] "LibrarySelection"               "LibrarySource"                 
[27] "Organism"                       "Platform"                      
[29] "ReleaseDate"                    "create_date"                   
[31] "version"                        "samp_collect_device"           
[33] "samp_size"                      "Sample Name"                   
[35] "SRA Study"                     

PRJNA685168

[1] "There are 114 rows and 57 columns in PRJNA685168 metadata"
 [1] "Run"                            "Age"                           
 [3] "Antibiotics"                    "Assay Type"                    
 [5] "AvgSpotLen"                     "Bases"                         
 [7] "Biologic"                       "BioProject"                    
 [9] "BioSample"                      "BioSampleModel"                
[11] "BMI"                            "Bytes"                         
[13] "Center Name"                    "Collection_Date"               
[15] "Consent"                        "DATASTORE filetype"            
[17] "DATASTORE provider"             "DATASTORE region"              
[19] "Duration"                       "env_broad_scale"               
[21] "env_local_scale"                "env_medium"                    
[23] "Experiment"                     "geo_loc_name_country"          
[25] "geo_loc_name_country_continent" "geo_loc_name"                  
[27] "Host"                           "ibd_subtype"                   
[29] "Immunomodulator"                "Instrument"                    
[31] "Lat_Lon"                        "Library Name"                  
[33] "LibraryLayout"                  "LibrarySelection"              
[35] "LibrarySource"                  "Organism"                      
[37] "Platform"                       "PriorTNF"                      
[39] "Read_Count"                     "ReleaseDate"                   
[41] "create_date"                    "version"                       
[43] "Sample_id"                      "Sample Name"                   
[45] "sex"                            "Smoking"                       
[47] "Source"                         "SRA Study"                     
[49] "Steriods"                       "Wk14_remission"                
[51] "Wk52_remission"                 "HBI"                           
[53] "behavior"                       "Location"                      
[55] "Endo_remission"                 "SCCAI"                         
[57] "Extent"                        


PRJEB21612

[1] "There are 117 rows and 49 columns in PRJEB21612 metadata"
 [1] "Run"                                     
 [2] "Assay Type"                              
 [3] "AvgSpotLen"                              
 [4] "Bases"                                   
 [5] "BioProject"                              
 [6] "BioSample"                               
 [7] "Bytes"                                   
 [8] "Center Name"                             
 [9] "Collection_Date"                         
[10] "Consent"                                 
[11] "DATASTORE filetype"                      
[12] "DATASTORE provider"                      
[13] "DATASTORE region"                        
[14] "ENA-FIRST-PUBLIC (run)"                  
[15] "ENA_first_public"                        
[16] "ENA-LAST-UPDATE (run)"                   
[17] "ENA-LAST-UPDATE"                         
[18] "environment_(biome)"                     
[19] "environment_(feature)"                   
[20] "environment_(material)"                  
[21] "Experiment"                              
[22] "External_Id"                             
[23] "geo_loc_name_country"                    
[24] "geo_loc_name_country_continent"          
[25] "geographic_location_(country_and/or_sea)"
[26] "geographic_location_(latitude)"          
[27] "geographic_location_(longitude)"         
[28] "human_gut_environmental_package"         
[29] "INSDC_center_alias"                      
[30] "INSDC_center_name"                       
[31] "INSDC_first_public"                      
[32] "INSDC_last_update"                       
[33] "INSDC_status"                            
[34] "Instrument"                              
[35] "Investigation_type"                      
[36] "LibraryLayout"                           
[37] "LibrarySelection"                        
[38] "LibrarySource"                           
[39] "Organism"                                
[40] "Platform"                                
[41] "project_name"                            
[42] "ReleaseDate"                             
[43] "create_date"                             
[44] "version"                                 
[45] "Sample Name"                             
[46] "Sample_Name"                             
[47] "Sequencing_method"                       
[48] "SRA Study"                               
[49] "Submitter_Id"                            



Run info using Entrez esearch function

There are 47 standardized columns associated with each bioproject.

esearch -db sra -query 'PRJNA477349[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA477349.csv;

esearch -db sra -query 'PRJNA802976[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA802976.csv;

esearch -db sra -query 'PRJNA685168[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA685168.csv;

esearch -db sra -query 'PRJEB21612[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJEB21612.csv;



Explore RunInfo from Entrez esearch

          RunInfoColumns EqualAllProjects
1                    Run             TRUE
2            ReleaseDate             TRUE
3               LoadDate             TRUE
4                  spots             TRUE
5                  bases             TRUE
6       spots_with_mates             TRUE
7              avgLength             TRUE
8                size_MB             TRUE
9           AssemblyName             TRUE
10         download_path             TRUE
11            Experiment             TRUE
12           LibraryName             TRUE
13       LibraryStrategy             TRUE
14      LibrarySelection             TRUE
15         LibrarySource             TRUE
16         LibraryLayout             TRUE
17            InsertSize             TRUE
18             InsertDev             TRUE
19              Platform             TRUE
20                 Model             TRUE
21              SRAStudy             TRUE
22            BioProject             TRUE
23       Study_Pubmed_id             TRUE
24             ProjectID             TRUE
25                Sample             TRUE
26             BioSample             TRUE
27            SampleType             TRUE
28                 TaxID             TRUE
29        ScientificName             TRUE
30            SampleName             TRUE
31          g1k_pop_code             TRUE
32                source             TRUE
33    g1k_analysis_group             TRUE
34            Subject_ID             TRUE
35                   Sex             TRUE
36               Disease             TRUE
37                 Tumor             TRUE
38      Affection_Status             TRUE
39          Analyte_Type             TRUE
40     Histological_Type             TRUE
41             Body_Site             TRUE
42            CenterName             TRUE
43            Submission             TRUE
44 dbgap_study_accession             TRUE
45               Consent             TRUE
46               RunHash             TRUE
47              ReadHash             TRUE



Demo with PRJNA477349 matadata

The PRJNA477349 contains latitudes and longitudes information which will enable dropping pins on collection sites.





Demo with PRJNA685168 matadata

The PRJNA685168 is an IBD study in relation to responses to biologic therapies, it contains sex and age features.



Demo with PRJNA802976 matadata

The PRJNA802976 is an IBD study in relation to responses to biologic therapies, it contains sex and age features.



Panning and mapping sampling points

The metadata must contain latitude & longitude columns or lat_lon of the collection point?

  • The leaflet R package can do a great job in dropping a pin on the corresponding coordinate.
  • Note that samples collected on the same coordinate will overlap.
  • You can zoom in-out to expand or minimize the map.
  • You can also mouse over the pin to see the variable label.



Bioproject: PRJNA477349



Bioproject: PRJNA802976






References

[1]
In-GitHub. (2023). Official repository for citation style language (CSL). Accessed on february 06, 2023. Retrieved from https://github.com/citation-style-language/styles



Appendix

Static Snakemake report

The interactive snakemake html report can be viewed by opening the report.html using any compartible browser. You will be able to explore the workflow and the associated statistics. You will also be able to close the left bar to get a better wider view of the display.



Troubleshooting

  1. CiteprocXMLError: Missing root element
    • Maybe the CSL file is empty. Some examples of citation style language are available on Github[1].